This project is a quick exploratory analysis project using R. The aim of the project is to use R to explore the relationships between data features.
We will be using a wine data set available at http://www3.dsi.uminho.pt/pcortez/wine/.
‘White Wine Quality’ is a tidy dataset which contains 4,898 white wines with 11 variables quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
We will use univariate, bivariate, and multivariate analyses to explore the relationships between the data features and to tease out the quality rating. You can read the final summary and reflection at the end of this document.
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
## [1] 4898 12
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
Dataset dimensions: There are 12 variables & a total of 4898 observations
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Dataset structure: All of the data observations are num or int, there are not factor data types
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Dataset features: We will review quality, alcohol, sulphates, density, and sugar.
## NULL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Quality values are distributed between 3-9 The mean is 5.88 and the median is 6.00.
Quality has a roughly normal bell shaped curve distribution The largest frequency scored 6(44.88%) and a small number of wines scored 9(0.1%).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol values range from 8-14.20 The avarage alcohol value is 10.51
The largest group by frequency, has a 9-9.5 alcohol count. There is a high concentration of wines between 8.5-12.5 alcohol count.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Sulphate values range from .22 to 1.08 The average sulphate value is .48
The majority of wines have between .3 and .6 sulphate content.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
White wine density falls between .98 - 1.03, with a mean of .99
This histogram is shifted to the left, which means density variable has at least one outlier.
The density content is closely grouped together for the majority of wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Most wines have 1.7-9.9 residual sugar content. There are big outliers as far as residual sugar.
As with density, the histogram of the amount of residual suger is shifted to the left. This means residual sugar may contain outliers. Let’s exclude the top 1% of residual sugar values…
We see that wines have 1-19 content of residual sugar.
Here we can see that the higher end of the outliers far outpace the majority of the scores.
Density and alcohol have the strongest relationship Alchohol has a negative relationship with density.
##
## Pearson's product-moment correlation
##
## data: alcohol and quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
Alchohol has a positive relationship (.43) with quality.
Now let’s remove the outliers
Generally, higher quality wines will have higher alchohol content.
Generally, as quality increases, sulphate content increases…
Now let’s remove outliers
Generally, as quality increases, sulphate content increases… … until quality scores reach 7.5
##
## Pearson's product-moment correlation
##
## data: sulphates and alcohol
## t = 0.35834, df = 1559, p-value = 0.7201
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04055754 0.05866309
## sample estimates:
## cor
## 0.009075112
Alcohol and sulphates have a weak relationship.
Density and residual sugar contents are both low count variables. Generally, the higher the density, the higher the residual sugar content.
## Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
Alchohol has a positive relationship with quality.
Generally, higher quality wines will have higher alchohol content. Alchohol has a negative relationship with density.
Findings: Alchohol has a positive relationship (.43) with quality. Generally, higher quality wines will have higher alchohol content. Alchohol has a negative relationship with density. Wine density has a negative relationship with the quality. wines with a higher score tend to have lower density.
Process: We can see that there are outliers in the density variable. We will remove them for the final plots portion. We will also add a darker background, for contrast. This more clearly highlights the difference between good and bad wines (because a neutral color is used for the OK wines) and the levels in quality are highlighted by the gradation in color.
Because quality is ordinal, we set ggplot() to use sequential or divergent color encoding (this is optimal because it gives a sense of gradation to the different levels in the data) - and not qualitative color encoding (which is for general discrete variables).
Where sequential color encoding is used for pure ordinal discrete data and divergent color encoding is used if the data is both ordinal and follows a diverging scale (think “Good, Ok, Bad” - which can be viewed as appropriate for this dataset). For example, for the plot above: (where “RdYlBu” is a specific divergent color scheme, the name option changes the legend title and direction=-1 changes the order of the colors)
## $title
## [1] "Relationships & Correlations"
##
## attr(,"class")
## [1] "labels"
Alcohol and density have the strongest relationshop; these two are neagatively related. Density and alcohol are the strongest determinants of quality.
The amount of residual sugar has a weak relationship wih the quality of wine.
There’s a high concentration of 5-7 quality wines… And these largely have alchohol levels between 8.5-13.5
## $title
## [1] "Relationship & distribution of alcohol, density, and quality"
##
## attr(,"class")
## [1] "labels"
Findings: Alchohol has a positive relationship (.43) with quality. Generally, higher quality wines will have higher alchohol content. Alchohol has a negative relationship with density. Density of wine has a negative relationship with the qaulity of wine.
Process: We removed the outliers in the density variable and added a darker background, for contrast. This more clearly highlights the difference between good and bad wines (because a neutral color is used for the OK wines) and the levels in quality are highlighted by the gradation in color.
Because quality is ordinal, we set ggplot() to use sequential or divergent color encoding (this is optimal because it gives a sense of gradation to the different levels in the data) - and not qualitative color encoding (which is for general discrete variables).
Where sequential color encoding is used for pure ordinal discrete data and divergent color encoding is used if the data is both ordinal and follows a diverging scale (think “Good, Ok, Bad” - which can be viewed as appropriate for this dataset). For example, for the plot above: (where “RdYlBu” is a specific divergent color scheme, the name option changes the legend title and direction=-1 changes the order of the colors)
EDA results: We reviewed six features in this data set: quality, alcohol, density, sulphates, and residual sugar. 1) Density is the best predictor of quality. Higher quality wines tend to have lower density and higher alcohol content. 2) Alchohol has a positive relationship (.43) with quality. Generally, higher quality wines will have higher alchohol content. 3) Sulphates and alchohol content generally do not have a strong relationship. 4) The amount of residual sugar has a weak relationship wih the quality of wine. Overall, higher quality wines tend to have low residual sugar content. Generally, as density increases, residual sugar will also increase
Process reflection: Here we used plotting (scatterplots, histograms, boxplots, and line graphs).
In the future, I would like to do a correlation matrix using all of the variables. I would also like to review ratios such as quality / alcohol. Secondly, I would like to do a deeper dive into alcohol and density. Lastly, I would like to create a model which with observations can predict wine quality with some degree of confidence. And overall, I would like more practice with transparency, jitter, smoothing, and limiting axes.